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1. INTRODUCTION 

I am grateful to all of the discussants for their 
comments which raise a number of important and in- 
sightful issues, and add significantly to the breadth 
of ideas. Following a few introductory comments on 
the need for a new regression genre that centers on 
dimension reduction, I turn to the discussants' re- 
marks. 

The development in the 1960s and early 1970s of 
diagnostic methods for regression produced a major 
shift in regression methodology. When a diagnostic 
produces compelling evidence of a deficiency in the 
current model or data it is natural to pursue re- 
medial action, leading to a new model and a new 
round of diagnostics, proceeding in this way until 
the required diagnostic checks are passed. By the 
late 1970s this type of iterative model development 
paradigm was widely represented in the applied sci- 
ences and was formalized in the statistical literature 
by Box (1979, 1980) and Cook and Weisberg (1982). 
With the availability of desktop computing starting 
in the mid-1980s, it is now possible to apply in rea- 
sonable time batteries of graphical and numerical 
diagnostics to many regressions. 

Advances in computing and other technologies now 
allow scientists to routinely formulate regressions in 
which the number p of predictors is considerably 
larger than that normally considered in the past. 
Such large-p regressions necessitate a new type of 
analysis for at least two reasons. First, the stan- 
dard iterative paradigm for model development can 
become untenable when -p is large. Recognizing the 
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variety of graphical diagnostics that could be used 
and the possibility of iteration, a thorough analy- 
sis might require assessment of many plots in ad- 
dition to various numerical diagnostics. Experience 
has shown that the paradigm can often become im- 
ponderable when applied with too many predictors. 
Second, in some regressions, particularly those asso- 
ciated with high-throughput technologies, the sam- 
ple size n may be smaller than p, leading to opera- 
tional problems in addition to ponderability difficul- 
ties. These issues have caused a shift in the applied 
sciences toward a different regression genre with the 
goal of reducing the dimensionality of the vector 
X G M*' of predictors as a first step in the analy- 
sis, effectively raising an old idea to a position of 
prominence. 

Today, dimension reduction is ubiquitous in the 
applied sciences, represented primarily by principal 
component methodology. Fifteen years ago I rarely 
encountered intra-university scientists seeking help 
with principal component reductions in regression. 
Such settings no longer seem unusual. The reasons 
for this are as indicated previously: While I occa- 
sionally see problems with n < p, more frequently 
n is several times p, while p itself is too large for 
a full commitment to iterative model development 
guided by diagnostics. This, in addition to the rea- 
sons stated in Section 2 of the article, leads me 
to conclude that the case for dimension reduction 
methodology has been made, methodology based on 
firm parametric foundations with subsequent robust 
and nonparametric counterparts. Whether the ideas 
and methodological directions I proposed will meet 
this goal is less clear, but I am still convinced that 
they hold promise when X and Y are jointly dis- 
tributed. 

In contrast to a comment by Christensen, I think 
parametric dimension reduction is currently as im- 
portant, if not more important, than other forms, 
partly because dimension reduction methodology has 
existed mostly in a world apart from core Fisherian 
theory, making it difficult to appreciate what could 
be achieved. For this reason I welcome Christensen's 
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development of connections with multivariate linear 
model theory. 

2. APPLICABILITY 

According to Christensen, a key issue in the devel- 
opment of models (2), (5), (10) and (13) is whether 
they are "broadly reasonable." I agree. Moreover, 
the emerging picture does seem to be one of broad 
reasonableness for the reasons indicated in the fol- 
lowing sections. 

2.1 First Applications 

As mentioned in Section 6.5 of the article, I have 
used the reductive model (13) on several standard 
data sets from the literature and nearly always have 
found d < p, and often substantially so. Addition- 
ally, I recently applied model (13) to a consulting 
problem with about 35 predictors and a little less 
than 300 observations. Again d<^p with an appar- 
ently good fit. 

The logo data analyzed by Li and Nachtsheim of- 
fer another opportunity to test the proposed method- 
ology. To complement their analysis I first used Grass- 
mann optimization to fit the reductive model (13) 
with d = I and the cubic option for fy , without pre- 
dictor screening. The log likelihood ratio statistic 
(cf. Section 6.5) for comparing this fit to the full 
model has the value Ai = 94.02 on r{p — 1) = 63 
degrees of freedom, for a nominal p-value of 0.007. 
The same process with other values of d gave A2 = 
66.76, A3 = 65.1 and A21 = 0.076 with correspond- 
ing p-values 0.26, 0.21 and 0.99. Consequently, I 
inferred that the sufficient reduction F^X is two- 
dimensional. This analysis provides another instance 
in support of the possibility that the proposed mod- 
els are broadly reasonable. Grassmann optimization 
is preferable to evaluating the log likelihood at var- 
ious combinations of the eigenvectors of S or Sfit , 
although the latter can sometimes provide a useful 
approximation. 

There are several ways in which an estimated suf- 
ficient reduction F-^X could be used to continue an 
analysis, depending on application-specific require- 
ments. As long as d is sufficiently small, as it is in the 
logo data, the standard model-diagnostic paradigm 
might be used to develop a forward model for the re- 
gression of Y on F-^X. In cases where a full forward 
model is not essential, we could proceed directly to 
estimation of the forward mean function based on 
the relation 

E{YN{x\Y)} 



where is the normal density of X|y and the ex- 
pectations are with respect to the marginal distri- 
bution of Y. This can be estimated using 



E(r|X = x) 



Er=ij/»A^(x|j/.) 
E^=l^(x|2/^) ' 



where denotes the estimated density obtained 
by substituting parameter estimates. The estimated 
mean function allows construction of residuals yi — 
E(y|X = Xj) that can be plotted to check the for- 
ward mean function implied by the inverse model. 
Here the residuals are used as a final diagnostic 
check, and not necessarily as an integral part of a 
model building process. 

2.2 Supervised Principal Components 

Li and Nachtsheim developed a fascinating con- 
nection between the reductive model (13) and the 
supervised principal component (SPG) setting of Bair 
et al. (2006). The latent variable model used by Bair 
et al. (2006, Section 2.2) and described by Li and 
Nachtsheim leads to the PFC model (5) or the re- 
ductive model (13), depending on the restrictions 
placed on the covariance matrix of the errors (e, ei, 
. . . ,€p). However, Bair et al. did not pose or use (5) 
or (13) as their basis for estimation, but instead used 
principal components. Thus, the method of SPC's is 
more in line with the PC model (2). Conformably 
partition 



X: 



Xi 
X2 



and F: 



ri 
r2 



E(y|x = x) 



E{N{^\Y)} 



and assume that F2 = 0. Then under the PC model 
(2), the sufficient reduction is just F^Xi and the 
maximum likelihood estimator of span(Fi) is the 
span of the first d eigenvectors of the sample version 
of Var(Xi). This leads to exactly the analysis sug- 
gested by Bair et al.: Use initial screening to identify 
X2 and then compute the first (or first few) principal 
components for Xi. I expect that PFC reductions 
based on model (5) or sufficient reductions based on 
model (13) can do much better than SPC's, partic- 
ularly if prior screening is based on an information 
criterion like AIC or BIC applied in the context of 
the inverse model. 

This expectation is supported by the logo data, 
since there is only a very weak relationship between 
the first two principal components and the estimated 
sufficient reduction based on model (13). For in- 
stance, the value of from the linear regression of 
the first principal component on F'^X is only 0.11. 
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The upper left plot in Li and Nachtsheim's Figure 
1 was computed using the first principal standard- 
ized component, the first principal component com- 
puted after standardizing each predictor to have a 
marginal sample standard deviation of 1 (personal 
communication) . The value of from the linear re- 
gression of the first principal standardized compo- 
nent on r-^X is 0.60. In this example the first princi- 
pal standardized component outperformed the first 
principal component, although this need not always 
be so (cf. Section 7.3). The general points in my dis- 
cussion of predictor standardization were captured 
nicely by Christensen's summary: Standardization 
is necessary, unless the regression can be described 
reasonably by the PC model (2) or the PFC model 
(5), but the common practice of marginal predictor 
scaling is not sufficient, and may even be counter- 
productive. 

On balance, SPC's are based on relatively weak 
supervision, since the response is used only in the 
screening phase. Models (5) and (13) allow more 
complete supervision, whether used with prior screen- 
ing or not. 

2.3 Variance Reduction versus Bias 

Consider a regression in which model (13) holds 
with d = p, which is equivalent to model (17), and 
that no version of model (13) holds strictly with 
d < p. In that case, fitting (13) with d < p will re- 
sult in bias along with a reduction in variance. It is 
conjectured that often the reduction in variance will 
outweigh the increase in bias, resulting in a reduc- 
tion in mean squared error. In other words, there 
may be reason to pursue models like (13) with d <p 
even when they are "incorrect." 

2.4 Partial Least Squares 

Like Christensen, I have had misgivings about par- 
tial least squares and found its apparent popularity 
in some areas, particularly chemometrics, to be a 
bit curious. However, the developments in the arti- 
cle are causing me to reconsider. In Section 7.4 of 
the article I developed a connection between OLS 
and the inverse model (17). Here I establish a con- 
nection with partial least squares via the popula- 
tion relationship between OLS and model (13) with 
d<p. 

For notational convenience, let M = + /3 . 
Var(fy)/3'^ and C = Cov(X,y). Suppose that we 
wish to estimate the population OLS coefficient vec- 
tor B = S"^Cov(X,y). From Proposition 4, E = 



Fofior^ + rMr^, and thus M"^ = (r'^SF)"^ and 
S"^ = ro{nl)-^T^ + FM-^r^. Now, 

B = s^^r/3Cov(fy,y) 
= r(r^i]F)-i/3Cov(fy,y) 

= -fr{s)B 

= r(r^i]F)~iCov(x,y). 

This says that B G 5r and that we can construct 
a marginal estimator of B by projecting the usual 
moment estimate B onto Sr relative to the 5] inner 
product. 

The population version of the partial least squares 
coefficients follows this same pattern Bpis = ^k(s)B 
(Helland, 1992; Helland and Alm0y, 1994; Naik and 
Tsai, 2000), except that Sr is replaced by the cyclic 
subspace 5k spanned by K = (C, I]C, 5]^C, . . . , 
S''~^C) for some integer q that is often chosen by 
cross-validation in practice. Assume that C can be 
written as a linear combination of at most q eigen- 
vectors of S. Then B G <Sk (Naik and Tsai, 2000) 
and consequently the marginal estimator of B from 
model (13) and the PLS estimator have the same ba- 
sic form, but differ on the method of estimating an 
upper bound — 5k or 5r — for span(B). Partial least 
squares estimates can be computed without inver- 
sion of S, which seems to be one of their attractions. 
In short, the proposed methods inherit support from 
the apparent reasonableness of partial least squares 
in some contexts. 

Principal components, principal fitted components, 
partial least squares and reductive methods based 
on the inverse model (13) all attempt to make use 
of information from the marginal distribution of X 
when inferring about y|X. This distinguishes them 
from methods like OLS, SIR, RMAVE and the lasso 
that apparently do not consider such information. 
The simulation results in the article show that sub- 
stantial gains over forward methods are possible when 
X contains information on y|X via S. It seems fair 
to conclude that the models considered in the article 
will be broadly reasonable at least within the class 
of regressions where X is informative about y|X. 

3. NONCONSTANT VAR(X|y) 

The development throughout the article was based 
on the assumption that Var(X|y) is constant. If 
Var(X|y) is not constant, then the reductions de- 
scribed here may no longer be sufficient, although 
they will still be functions of sufficient reductions. 
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Li and Nachtsheim describe an extreme case of this 
where the forward mean function E(y |X) is quadratic 
in one of the predictors without a hnear trend. For 
example, when E(y|X) = Xf and Xi is symmetri- 
cally distributed about 0, E(Xi|y) = while 
Var(Xi|y) will be nonconstant. The potential for 
this kind of setting can be detected by using a nu- 
merical diagnostic for heteroscedasticity in the con- 
text of fitting a simple linear model to ^jl^, j = 
l,...,p. 

B. Li has taken an interesting first step in the 
study of models that allow for nonconstant Var (Xjy) . 
His Theorem 2.2 shows that we can construct suffi- 
cient reductions for both the conditional mean and 
the conditional variance, and thus cover settings like 
that described by L. Li and Nachtsheim. In the con- 
text of his model (5), the methods of this article 
should be useful for estimating span(ri), but they 
will not be able to identify the part of span(r2) 
that is not contained in span(ri). However, even 
if span(ri) = span(r2), there should be efficiency 
gains by considering models like B. Li's (5). These 
gains are nicely illustrated by Li's examples. Ad- 
ditionally, analyses based on heteroscedastic inverse 
models will likely encounter inference issues not con- 
sidered in the article. For instance, it may be of in- 
terest to test if span(r2) Q span(ri) or vice versa. 

4. PRINCIPAL COMPONENTS 

B. Li presented an intriguing explanation for why 
the response tends to have a higher correlation with 
the first principal component than any other compo- 
nent, but concluding that dimension reduction via 
X|y should offer substantial gains. I expect that his 
Conjecture 1.1 is correct and that it offers a par- 
tial explanation for the popularity of reduction to a 
few leading principal components. We may be able 
to gain additional insights into the situation reason- 
ing as follows, using Li's notation and model. Recall 
that X is iV(0,i;), Y = X + e and X _LL e. 

Hold fixed (5 with = 1 and S with eigenvalues 
Ai > • • • > Ap > 0, and assume that the errors e are 
normally distributed with mean and variance cr^. 
Then the squared correlation coefficient between Y 
and the first principal component vJ^X can be ex- 
pressed as 

If the eigenvalues Xj are roughly the same, then the 
magnitude of pi is controlled by a1 and the an- 
gles between /3 and the eigenvectors Vj. However, 



if (3 vi^Q and Ai is sufficiently larger than both 
A2 and ai, then p\ will be close to 1. This might be 
taken to suggest that reduction to the first principal 
component is desirable, particularly when Ai ^ A2. 

However, Y\v'iX is normally distributed with mean 
^{Y\vi X) = 0^viv'iX and variance 

Var(y|^fX)=a2 + f](/3^t;,)2A,. 

i=2 

This distribution does not depend on the value of 
Ai. Provided /? / vi, Var(y|t;fX) > Var(y|/3^X), 
reflecting the fact that ViX is not a sufficient re- 
duction and consequently that conditioning on ViX 
does not exhaust the information that X contains 
about y, regardless of the magnitude of pi. 

5. BINARY PREDICTORS 

The main thrust of my article is on normal inverse 
models, but I also suggested how the ideas could be 
applied with conditional predictors X|y from other 
families. I was particularly pleased to see that Li 
and Nachtsheim implemented an algorithm for re- 
gressions with all binary predictors, but was simul- 
taneously a bit disappointed to see that it did not 
work out as crisply as expected. Their suggestion of 
a majorization strategy to resolve the optimization 
issues is excellent and will likely produce a stable 
and practically useful algorithm. Majorization may 
not be essential for similar algorithms developed for 
principal fitted components {uy = Pfy), although it 
might still improve performance. 
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